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Abstract. In this article we tackle the problem of inverse nonlinear ill-posed problems from 
a statistical point of view. We discuss the problem of estimating the non observed function, 
without prior knowledge of its regularity, based on noisy observations over subspaces of 
increasing complexity. We consider two estimators: a model selection type estimator for the 
nested and the non nested cases, and a regularized adaptive estimator. In both cases, we 
prove consistency of the estimators and give their rate of convergence. 



I. Introduction 

The term inverse problem is used to denote a wide area of problems arising from both pure 
and applied mathematics. It deals with situations where one must recover information about 
a quantity or a phenomenon under study, from measurements which are indirect and possibly 
noisy as well. Intuitively, an inverse problem corresponds to deriving a cause, given some of 
its effects. This kind of problems confront practical scientists on a daily basis. 

In this article we are interested in recovering an unobservable signal xq based on observations 
(1.1) y(U) = F(x Q )(ti) + £i, 

where F : X — > Y is a nonlinear functional, with X, Y Hilbert spaces and ti, % = 1, . . . , n is 
a fixed observation scheme, xq '■ 1R — > 1R is the unknown function to be recovered from the 
data y(ti), i = 1, . . . ,n. The regularity condition over the unknown parameter of interest is 
expressed through the assumption x G X. We assume that the observations y(tj) G 1R and 
that the observation noise Si are i.i.d. realizations of a certain random variable e. Throughout 
the paper, we shall denote y = (l/(ti))™ =1 . We assume F is Frechet differentiable and ill posed 
in the sense that our noise corrupted observations might lead to large deviations when trying 
to estimate Xq. In a deterministic framework, the statistical model (jl.lj) is formulated as the 
problem of approximating the solution of 

F(x) = y, 

when y is not known, but is only available through an approximation y s , 

\\y-y 5 \\<S. 

It is important to remark that whereas in this case consistency of the estimators depends on 
the approximation parameter 5, in (jl.lj) it depends on the number of observations n. 
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In general, in the linear case, the best L 2 approximation of x is x + = F + y, where F + 
is the Moore-Penrose (generalized) inverse of F, does not depend continuously on the left- 
hand side y. In particular, for the ill-posed case, unboundedness of F + entails that F + (y s ) 
is not close to x + . Hence, the inverse operator needs to be, in some sense, regularized. In 
the nonlinear case, under ill-posedness of the initial problem, we will always mean that the 
solutions do not depend continuously on the data, but it is no so clear to find a criterion to 
decide whether a nonlinear problem is ill-posed or not. Thus, if F is a compact operator, 
local injectivity around x + a solution of F{x + ) = y, implies ill-posedness of the problem, 
as it is quoted in [?]. Regularization methods replace an ill-posed problem by a family of 
well-posed problems. Their solution, called regularized solutions, are used as approximations 
to the desired solution of the inverse problem. These methods always involve some parameter 
measuring the closeness of the regularized and the original (unregularized) inverse problem. 
Rules (and algorithms) for the choice of these regularization parameters as well as convergence 
properties of the regularized solutions are central points in the theory of these methods, since 
they allow to find the right balance between stability and accuracy. 

When F is linear, the statistical problem has been extensively studied, although in gen- 
eral efficient parameter choice is still under active research. Two main types of estimators 
have been considered. First regularized estimators such as Tikhonov type estimators, then 
non linear thresholded estimators. The first approach has been studied in great detail. An 
interesting early survey of this topic is provided by O'Sullivan in [?]. In this setting, the 
main issues are what kind of regularizing functional should be considered and closely related 
what the relative weight of the regularizing functional should be. More recently, Mair and 
Ruymgaart in [?] studied different regularized inverse problems and proved the optimality 
of the rate of convergence for their estimators. Special attention has been devoted in this 
setting when considering a Singular value decomposition (SVD) of operator F. We cite the 
recent work in this direction developed by Cavalier and Tsybakov in [?] or Cavalier, Gol- 
ubev, Picard and Tsybakov in [?]. The second approach has its most popular version in 
the wavelet-vaguelet decomposition introduced by Donoho [?] . In this case the main issue is 
finding an appropriate basis over which F + , the generalized inverse, is almost diagonal. This 
idea is further developed by Kalifa and Mallat [?] who introduce mirror wavelets. Closely 
related, Cohen, Hoffmann and Reiss in [?] construct an adaptive thresholded estimator based 
on Galerkin's method. 

However, scarce statistical literature exists when F is non linear. Among the few papers 
available, we point out the works [?] or [?] where some rates are given. A different type of 
approach is developed in Chow and Khasminskii [?] for dynamical inverse problems. Moreover 
Bissantz et al. in [?] discuss in their work a nonlinear version of the method of regularization 
(MOR). In practical situations, a linearization of the inverse operator is often performed, 
though the effects of this linearization are seldom studied. In our paper, under certain as- 
sumptions controlling the nonlinearity of the operator, we are able to give precise rates of 
convergence. 

Yet, nonlinear inverse problems are very common in practice, since they often arise when 
studying the solution of a noisy differential equation. Optical or diffuse tomography refers to 
the use of low-energy probes to obtain images of highly scattering media. The inverse problem 
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for one of the earliest models of optical tomography amounts to reconstructing the one-step 
transition probability matrix, defined by a system of nonlinear equations, see for instance 
the work of Griinbaum in [?]. The issue of tomography is also often tackled in a Bayesian 
framework (see for instance [?]). However this method requires a large number of regularized 
inversions of the operator and a strong a priori over the data, while a frequentist adaptive 
method can solve this issue in a more efficient way. In electromagnetics, the observations 
also follow this framework, [?] . Moreover, a large class of problems in economy are related to 
price variations under constraints, which also can be modelled using differential equations. It 
is also the case in econometrics when studying statistical issues with heterogeneity. Adding 
an instrumental variable is equivalent to transforming the unknown direct problem into a 
nonlinear inverse problem, see for instance the work of Darroles, Florens and Renault in [?]. 
We must remark, however, that in the deterministic case, this problem has been tackled. The 
authors use mainly L 2 regularized Tikhonov type estimators and show that they provide a 
stable method for approximating the solutions of a nonlinear ill-posed inverse problem. For 
general references, we refer to the work of Engl, Hanke and Neubauer in [?], [?], Tautenhahn 
in [?] or Tikhonov and Leonov in [?] . The choice of the regularization sequence is here crucial. 
On the one hand, this issue is often practically solved by numerical methodology which relies 
on grid methods and recursive algorithms, see for instance [?], [?], [?] or [?]. This point of 
view is close to cross-validation methods in a statistical framework, see also [?]. On the other 
hand, a priori optimal choices of the smoothing sequence are given in the work of [?], leading 
the way to an adaptive estimation of x. Such techniques, as well as model selection techniques 
can be used in a random framework. 

Our goal in this article is to estimate the parameter of interest Xq, when F is an ill-posed 
but known operator. We aim at using complexity regularization methods to construct this 
estimator. We will deal with a large class of operators, linear but also non linear operators 
that still undergo some assumptions that will be made precise later. Moreover, we want this 
estimator to achieve optimal rates of convergence when the smoothness of the true solution 
is not known a priori. From the statistical literature ([?], [?]) it is clear that we have to ask 
for some kind of penalization in order to obtain satisfying results when estimating the inverse 
problem. 

Hence we consider penalized M-estimators minimizing quantities of the form 
(1.2) x n = argmin (7(1/ - F(x)(t)) + a n pen(x, X)) , 

where X is a specific set, 7Q is a loss-function, pen(., .) is a penalty over x and/or X, and 
a n is a decreasing sequence all of which will be defined precisely later. The idea of penalized 
M-estimators is to find an estimator close enough to the data, close in the sense defined by 
7 and with a regularity property induced by the choice of the penalty pen. The smoothing 
sequence a n balances the two terms. The greater a n , the smoother the estimator will be, 
while the smaller a n the closer the estimator will be to the data, maybe leading to a too 
rough estimate. 

Two main types of penalties are considered: dimension penalties that will lead to model 
selection type estimators and regularity penalties which will lead to regularized estimators. 
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• In the first case the idea, as developed in [?], consists in looking at a sequence of 
subspaces of increasing complexity (for now complexity will be defined in terms of 
dimension and we shall look at a sequence of linear subspaces to be defined later). 
This is usually considered a discretization method. The estimator is a projection 
estimator and the choice of the set is defined by the penalty. 

• In the second case, a regularization functional pen(x, X) = J(x) (typically a quadratic 
functional) is introduced and appropriate weights are considered for this term. Actu- 
ally, this method is known as Tikhonov regularization when the penalty is quadratic, 
see for instance [?]. 



In both cases, the choice of either the collection of spaces or the smoothing sequence 
determines the behaviour of the estimator. Since we want to construct adaptive estimators, 
we investigate choices that do not depend on an a priori knowledge of the regularity of the 
true function xq. Indeed, if the unknown function xq is of known regularity it is quite simple 
to estimate good discretization or regularization schemes a priori: find a subspace or find a 
regularization parameter such that the the error is smaller that a certain prescribed threshold. 
However, usually this is not possible since the smoothness of the solution is unknown, and 
adaptive methods must be used. Adaptivity means here, that the construction of the estimator 
does not require knowing beforehand the regularity of the function of interest to be recovered 
xq. In the inverse problems literature this is known as a posteriori methods. But, we do 
assume that the inverse operator is known as well as some assumptions, such as its degree of 
ill-posedness. 

In model (jl.lj) the assumptions entail that the variance grows with the complexity: this is 
an important difference with standard numerical analysis. Indeed, this difference yields other 
optimal rates which are usual in statistics ([?], [?]) and which we will give below. 

Finally, what is the price to pay here for nonlinearity ? Since we will consider a linear 
expansion of the operator F in a neighborhood of x, the introduction of non-linearity requires 
controlling the linear part of the Frechet differential operator F in balls around the true 
solution xq. As opposed to linear problems, this fact entails the need of finding a "good" 
initial guess which we shall denote x*. Moreover, the ill posedness of the problem requires 
relating the non linearity to the smoothing properties of F (xq), see condition AF below. We 
remark that this kind of condition is at the heart of probabilistic control of noise amplification. 
We show, that, under such restrictions, the nonlinearity of the problem does not change the 
rate of convergence and we still are able to build adaptive estimators. 

The article is divided into five main parts. In Section |2] we describe our general framework. 
In section H2 we discuss discretization methods (projection estimators) and find optimal rates 
for ordered and non ordered selection. In section 0] we tackle regularization methods: we 
prove optimality of an adaptive Tikhonov like estimator. Section El is devoted to technical 
lemmas which will be useful in the paper. 
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2. Presentation of the problem 

In this section we introduce general notation and assumptions. These include standard 
concentration assumptions over the observation noise and some restrictions over the class of 
operators F(.). 

2.1. General assumptions. 

Recall that we want to estimate a function x : 1R — > IR. It is important to stress that 
the observations depend on a fixed design (t\, . . . ,t n ) G IR n . This will require introducing an 
empirical norm based on this design. Set Q n to be the empirical measure of the covariables: 

1 n 

Qn = -y~^*<- 

n ^-^ 

1=1 

Here we have set 5 the Dirac function. The L 2 (Q n )-norm of a function y e Y is then given 
by 

\\y\\n = (J y 2 dQ n ) 1/2 , 

and the empirical scalar product by 

1 n 

<y,e > n = -V] £#(*»). 
i=i 

Remark this empirical norm is defined over the observation space Y . Over the solution space 
X we will consider the norm given by the Hilbert space structure. For the sake of simplicity, 
we will write ||.||x = |-| when no confusion is possible 
We also introduce certain standard assumptions on the observation noise 

AN moment condition for the errors: 

e is a centered random variable satisfying the moment condition E(|£:| p /o" p ) < p\/2 
and E(e 2 ) = a 2 . 

In the non linear case, our convergence analysis will ne a local one, hence it is necessary to 
start with an initial guess of the solution. We require that this starting point x* satisfies the 
following conditions: 

• This initial guess should allow to construct a good approximation of the unknown 
Frechet derivative of the operator evaluated in x : F'(x ). For this, assume F is 
Frechet differentiable and the range of 

DF{x u x 2 )= I F'(x 1 9 + x 2 (l-9))d9 
Jo 

remains unchanged in a neighbourhood of the guess solution x*, i.e in the ball B p (x*) = 
{x, \\x — x*\\ < p}, for a certain p > 0. More precisely, we assume using the same 
assumption as in [?] 
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AF control over the non linear part of the differential operator: 

There are Ct, a fixed linear operator T (generally T = F'(x*)) and a linear 
operator depending on x and x', written R(x,x') such that for x,x' G B p (x*) 

F(x) - F(x') = TR(x, x')(x - x'), 

with \\I — R\\ < ct- 

Hence, T is a known bounded linear operator that can be seen as some approximation 
to F (x) in a neighbourhood B p (x*), which we must be able to use in our computations. 
Note also that, in contrast to that, provided a bound holds for \\I — R\\, these linear 
operator needs not be known explicitly. In the Section 12.44 we discuss such drastic 
restrictions for the operators and provide examples satisfying to these assumptions. 
• Assume also, that the image by the linear operator T of x* and Xq are close, in the 
sense that 

IG identifiability condition: 
x Q — x* G i^er(T)- 1 -, 

where iTer(T))- 1 - is the orthogonal complement of the null space of the operator T. 
The following approximation result ([?]) assures uniqueness of the sought solution if 
the initial guess is sufficiently close. 

Lemma 2.1. Assume AF holds with ct < 1/2 and assume for x,x' G B p (x*), F(x) = 
F(x') and x — x' G i^er(T)- 1 . Then x = x' . 

This lemma guarantees the identifiability of the estimation problem (jl.lj) since the 
solution is uniquely chosen. 

If x* is such that xq G B p (x*), the local behaviour of F will be defined by operator T. We 
assume the regularity of the problem is defined by that of F'(xq) hence its approximation T: 
this linear operator acts with a degree of ill-posedness defined by an index p, which comes 
from the fact that the operator is not compact and therefore its inverse is not L 2 bounded. 
This is generally expressed by the fact that T maps L 2 into some Sobolev space H p . This 
condition is quite natural when studying the ill-posedness of operators, see for instance [?]. 
That is the reason why, we assume T acts along a Hilbert scale H s . 
IP ill posedness of the operator: 

There exists p > such that F'(x )(H s ) = H s+p . 

This property can be expressed in an equivalent way by the ellipticity property 
(2.1) < Tx, x >~ 

where H~ p / 2 stands for the dual space of the Sobolev space Hp/2- 
2.2. Approximating subspaces. 

Estimating over all X is in general not possible. We shall thus assume we are equipped 
with a sequence Y\ C Y-i . . . C Y m . . . C Y of nested linear subspaces whose union is dense in 
Y. We assume 

dim(Y m ) = d m . 

Denote the projection of W over any subspace Z by UzW. Let IIy m stand for the projection 
in the empirical norm. Set T m = ITy T, for the linear operator T defined in AF. Indeed we 
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get the following diagram 

T 

X 




X m 

As we are assuming that over Y we consider the empirical norm, and over X the usual L 2 
norm, the adjoint operator of T m with respect to such topology, T^, actually depends on the 
observation sequence tj. However, we will usually drop this fact from the notation. 

For example, if Y m is generated by some orthonormal basis = (0i, . . . , (fid m ), with respect 
to the L 2 norm over Y, and T = Id, then 

dm 

3=1 

where y^ n =< Ily y, 4>j >n are the solution to the projection problem under the empirical 
measure Q n . Set G = [<t>j{ti)]i,j to be the Gram matrix associated to basis {4> 3 )j >r Thus, 

y j , n = (G t m G m )- 1 G t m (y(t 1 ),...,y(tn)). 

Define X m = T^Y. Let A + stand for the generalized inverse of a closed range operator A. 
Then, by construction 

U x = (IT™ T)+n™ T. 

Our goal is to estimate the unknown xq by x m G x* + X m in such a way that F(x m ) is 
close to the observed y. By assumption IG this is saying we want to approximate KeriT) 1 - 
by means of the collection X m in some sense that is adjusted to the observation error. 

Define 

^ = ll(n^) + nyi- 

This quantity controls the amplification of the observation error over the solution space X rn . 
We have, [?], 

v m >lm-= inf \\T*v\\. 

veYm,\\v\\=l 

Parameter 7 m expresses the effect of operator T* over the approximating subspace Y m . In the 
case of T acting over a Hilbert scale H t) we may assume T(H S ) = H s+P and 

(2.2) lm = d~J. 

On the other hand this term is related to the goodness of the approximation scheme. Indeed, 
[?] it can be seen that 

7 m +i < \\(TV-Km))\\ <c||/-ny| n :=c 7 ™ 

as \\T*\\ is bounded. Here 7 m describes the approximation properties of the sequence Y m . The 
next assumption requires that 7 m ~ 7™, that is, that there exist constants ci,C2 such that 

c 2 < imh m < a- 

AS amplification error: 

Assume there exists a positive constant U such that ||7 — Hy m || n < VU^f m - 
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Remark 2.2. Assumption AS thus establishes that the worst amplification of the error over 
X m is roughly equivalent to the best approximation overY m . If we think ofY as smoothed by 
the action of operator T , the above assumption is quite natural. In the case of an operator 
acting over a Hilbert scale, we may assume Y = H p and then 7 m = <i~ p . 

2.3. Rates of convergence for inverse problems. 

Consider the projection estimator x m over X m D B p (x*). That is, x m is chosen in a such a 
way as to minimize \\y — F(x m )\\ n . The estimation error can be thus bounded by 

(2.3) \\x -x m \\ < ||(/-n Xr >o|| + E^d 



Since Y m is a sequence of linear subspaces E||riy m e||^ = 0(d m /n). Thus rates are of order 

d 1/2 



\\(i-u x jx \\ + 



This rate depends on the ill-posedness of the operator and the approximation properties of 
X m . In some cases these are known precisely and in others they can be deduced from the 
properties of the solution xq. One such case is the following source assumption encountered 
typically in the inverse problems literature 

SC source condition: 

There exists < v < 1/2 such that x e Range{{T*T) u ) = TZ{{T*T) V ) 

Indeed consider 

A ViP = {xeX,x = (T*T) u lu, < p} 
where < v < u , u > and use the further notation 



(2.4) A„=\jA VtP = K((T*Ty) 

These sets are usually called source sets, x G A vp is said to have a source representation. 
The requirement for an element to be in X p p can be considered as an smoothness condition. 
Then, following [?] and under SC we have if v < 1/2, 

ui-n Xm )x \\<\\i-ii Ym rs = o(d m ^), 

for a certain p. 

This leads to the rate 



Interpreting this rate in the statistical literature reads s = 2up: the regularity depends on 
the ill-posedness of the problem. In the ill posed literature the error is not related to the 
underlying dimension so that rates are different. Typically in a Hilbert scale setting, if the 
true solution x E H s , optimal rates are of order 0(n~ s ^ 2s+2p+1 ' ) ), see for example [?]. 



2i/p 

- x \\ — 0(n 4"p+2p+i). 
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2.4. Comments on the assumptions. 

We impose severe conditions [IP] and [AF] for the inverse operator. 

First of all, we note that this paper handles the case of linear operators since the assump- 
tions are fulfilled with R = Id and x* = 0. All the results are valid without any further 
assumptions. 

For the non linear case, we cannot expect a general rate result without such a condition, 
which measures the difference between the operator and its linear counterpart, i.e when T = 
F (x*), in terms related to the ill posedness of the operator. 

The nature of the condition we impose comes from the fact that the rates of convergence are 
drawn from the source conditions [SC]. Hence, in the non linear case, we need to impose as 
quoted in [?] that (F (x*))*(y - F{x*)) G K(F (x )*). A necessary condition is given by 

n(F'(x*))*)cn(F'(xoT). 

This condition assures we can find a linear operator R(x,x*), with ||Id — R\\ < for some 
constant Ct, such that for all x G B p (x*) we have 

F'(x) = F\x )R(x,x*). 

Other more standard assumptions in the statistics literature for inverse problems, such as 
F (.) being Lipschitz, are not well suited to deal with ill-posed problems. 
As a matter of fact, as stated in [?] or [?], the usual assumptions such as Lipsischitz condition 
(with constant L) on its Frechet derivative 

\\F(x) - F(x ) - F'(x)(x - x )\\ <^\\x- x \\ 2 , 

implying an estimate of the first-order Taylor remainder, are not appropriate in this case. 
That is the reason why required range invariance conditions of the type 

F(x) - F(x') = TR(x,x')(x - x'), \/x,x' G B p (x*) 
||i?-Id|| < c. 

This means, as quoted before, that the range of the divided difference operators DF(x, x') in 

F(x) - F(x') = DF{x, x')(x - x) 
remains unchanged in B p (x*). 

Such condition has been verified for many applications, see for instance [?], [?], [?] or [?]. The 
alternative assumption, often used also in the literature for numerical solution of non linear 
inverse problems, is called the tangential cone condition 

\\F(x) -F(x) -T(x-x')\\ < c\\T(x - x')\\, \/x,x' G B p (x*). 

Previous authors point out that assumption we have chosen in this paper, plays a role in 
certain parameter identification problems from boundary measurements where the tangential 
cone condition might be hard or impossible to verify. 
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As an example consider the non linear Hammerstein inverse problem. The operator F is 
given by: 

F : Hx[0, 1] — > L 2 [0, 1] 
Jo 

If is assumed to belong to C 2 ' l (I) for all intervals / C IR, Hanke et al. in [?] prove that such 
inverse problem is ill posed and that the operator undergoes the previous restrictions. In [?], 
the assumptions are satisfied by the inverse groundwater filtration problem of identifying the 
transmissivity a in 

(2.5) -V(aVu) = / in tt 

u = g ond fl, 

on a C 2 domain Q C M 3 , from measurements of the piezometric head u, where is the 
parameter-to-solution map F : a — ► u. 

The approximation properties are quite natural having in mind Sobolev spaces of different 
order and piecewise polynomial approximations, as it is highlighted in [?]. There are some 
examples undergoing such conditions. For instance, in this previous work, it is also proved 
that the problem (j2.5j) fulfills these assumptions with p = 2, since the linearization T of the 
operator F acts in as smoothing a way as integrating twice. 

3. Complexity regularization 

3.1. Ordered selection. 

Consider to begin with that Y m ,m e M n is a sequence of nested subspaces. We define 
ordered selection as the problem of choosing the best m based on the observations. For this 
we will construct penalized estimators that require finding the first m that minimizes 

\\^Y m (y ~ F(x m ))\\ 2 n + pen(m), 

where pen(m) is an increasing function. From a deterministic point of view this is essentially 
equivalent to choosing m based on the discrepancy principle (see [?]for an application of the 
discrepancy principle to non linear problems), however the fact that the error does not have 
a finite energy and that the goodness of fit depends on the number of observations introduces 
important changes both in the methods of proof as in the definition of the estimator. 
More precisely, define the estimator 



(3.1) Xm = arg min arg min ||ITy (y — F(x))\\ 2 + pen(m) 

m£M n x£x*+X m ,x£B p (x*) m 

where pen(m) > r(l + L)a 2 d m /n, r = 2 + 9, for some 9 > 0, and L > 0. 

Numerically, minimization in the above expression is more complicated than it would be 
in the linear case because we must calculate the projection matrix at each step. However, 
choosing an efficient sampling scheme will do the job. 

The next theorem says the above estimator is also efficient in terms of the rates in equation 
(|2.3j) except for a constant. As a matter of fact, the model selection estimator has a rate 
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of convergence less or equal than the best rate achieved by the best estimator for a selected 
model. 

Theorem 3.1. There exist constants C(r,cr), and k(r,a) such that with probability greater 
thanl-2e- kum+1)) 

(3.2) \\x m -x \\ 2 <C(r,a) inf - n Xm )x \\ 2 + + - 

Proof: For each m, assume x m = x* + z m , z m G X m , and x m G B p {x*). Set 

m (F(x m ) - F(x )) 



w(x m ) 



m (F(x m ) - F(x ))l 



We divide the proof in a series of steps. 

• Control of d m . Recall the penalization is defined by pen(m) = r(l + L)a 2 [d m + 
with 2 < r a certain constant. 

Following standard arguments we have 

\\U Y jF(x m ) - F(x ))\\ 2 n + pen(m) 
< \\\U Ym (F(x m ) - F(x ))\\l + 2 < n yA (F(x A ) - F(x )),e > n 

-2 < U Y jF(x m ) - F(x )),e> n -\\U Yjh e\\ 2 n + \\U Ym e\\ 2 n + pen(m). 

Let < k < 1. Since 2a6 < ko? + ^6 2 , for any a, 6 we have for any m and x m G X m 

2<n ym (F(a: m )-F(xo)),e> ri 
< «||n ym (F(a: m )) - F(x ))\\l + -\ < w(x m ),e> n | 2 . 

K 

Set for x > 0, t(m) = c[d m + 1] + (1 + e)~ 1 x and assume k and c are chosen in 
such a way that -((1 + g) + (1 + l/g)c) — C\ < r(l + L). Remark, k can be chosen 
very close to one. 

Thus, 

2 , / it , <i\ ^ 2 Mm+ 1] 



(1 - k) llfly^ (F(x m ) - F(x ))\\l + (r(L + 1) - Cl ) 



2 1 I / ~ \ ^ 1 2 ^([c^ + l]) 



< (1 + «)||n ym (F(x m ) - F(x ))|| 2 + -| < £> n I - Cl 



|n y .d| 2 + |in y £ 112 

I J- m I In 1 II I m 



k n 

rfi^ Mn 1 II "^m 4 -" II n 

+ - < w(x m ), e > n 2 - Cl ^^ ^ + r L + 1 + ci)— ^ ±. 

k n n 

Now since k < 1 and ci = Ci(e) < r(L + 1), we have for fixed e > 0, 
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CJ 2 {drn + 1) 



n 



< 



+ 



r — C\ 
2 



{\\F{x m ) - F{x ))\\ 2 n + pen{m)) 

2 ci ,a 2 [d m + 1] 

SUp (| < U m , e > n | - — K{- 



)). 



«(r - ci) m ,|| Mm || n= i 2 n 

for u m G S'm a countable dense subset of X m . Note that using Lemma lo~Tl 



sup - < u m ,e > n = c\\n$ m e\\ n . 

||Um|[„=l V K / 

We will note for any square matrix A 

p 2 (A) = max eigenvalue^' A). 

Also note that, 

pWX) = 1 and Tr ( n L fn L) = ^- 

Hence, by Lemma f5.3| there are constants d and c 2 such that 



P 



sup sup ( I < u m ,e > n \ 2 — ci/2ac( 

m ||u m ||=l 



<7 2 [d m + l]. 



> - 

n 



< VP( sup |< Mm ,e> n | 2 -c 1 /2/ t ( a2[dm + 1] )> 
ll« II -i n 



< ^exp{- v / C /(x + c 2 L[rf m + 1])} < Cse-v 7 ^ 2 , 



setting C 2 = J2 m e-^/ C2L[dm+1]/2 . 

So that with probability greater than 1 — C 2 e~^ dx ^ 2 we have 

a 2 [drn + l] 



n 



< inf inf 

m x 



-— A (||F(x m ) - F(x ))|| 2 + pen(m)) + *. 

er*+z m r(l + L) — ci n 



On the other hand let, for any given m, x m stands for the "projection" of Xq over 
x* + X m , i.e. x m = x* + z m is such that 



U^ m F(x m ) = U^ m F(x ). 

Let K 2 = Pr 2 - for Ct defined in AF. We have the following lemma, the proof of 
which is exactly as that of Lemma 2 in [?] . 

Lemma 3.2. Assume K 2 and x* are such that 

k 2 \\[i - n x j(x - OH + ||n Xm (x - OH < P . 

Then there exists x G x* + X m , such that is satisfied. 
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IIy m (F(x m ) - F(x ) - T(x m - x Q ) + T{x m - x )) = 0, 
so that since (H%T) + Il$ m T = U Xm 

Rxjxm-xo) = m T) + n" T(R(x m ,x )-I)(x m -x )) 



U x (R-I)(x m -x ) 



On the other hand, 



and therefore, by condition [AF] 



\%m ^o II — 



1 — Ct 



[I-IL Xm )(x*-x ) 



and 



F(x m ) - F(x )\\ = \\(I-U n Y J(F(x m )-F(x ))\\ 

< 7 m (l + c T )||i m -x || 

< 7 m ^ ± ?ll(/-nxJ(^-x )||, 

1 — ct 

||(/-n» )Tx\\ n 



where 7 m = sup x 

Now let d mopt be such that 



d mopt = argmin 



1 — ct 

Since we are looking at ordered selection 



(7 m ) 2 (^) 2 ||(/ - n x J(x* - :r )|| 2 + pen(m) 



9(m) = 7m(i ! 

1 — ct 



! \\(i-n Xm )(x*-x ) 



is a decreasing sequence, so that the minimizer must be such that g(m) = pen(m) 
Hence we have 



P 



4rfl + L) 

( d ™ + 1 - r(l +M_ Cl (^ + 1))+ > « 



and dtfj < ^i^p~ (^m opt + 1) — 1 + w with probability greater than 1 — C 2 e V du / 2 . 
Error bounds: Set 

A = \\F( Xlh ) - F(x ))\\l - inf inf -^-{\\W Y (F{x m ) - F(x ))\\l + pen(m)) + . 



m x m \ — k 



Lemma f5. 31 also yields 



P(A > x/n) < C 2 e~ dx/2 . 
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We are now able to prove optimal rates for our estimator. For this we need to bound 
||ri£ m (F(^) - F(x ))\\ n from below. 

Let Vt dim {u) be the set such that < r ^^y ci \d mopt + 1] + u - 1. Let VL fit {u) be 
the set where A < u. In this section we assume we are always in VLdimiu) U VLf it (u). 
We require the following Lemma 

Lemma 3.3. Let x e 7Z(T*T) + x*. There exists a constant C such that 

\\U Ym (F(x) - F(x* + U Xm (x - x*)))\\ < C(l + c T ) 7 m ||(/ - IL Xm )(x - 
Proof. Let x - x* = T*y, with y e TZ{T). By definition 

(I-Tl x Jx = T*(I-Tl Ym )y = T*w, 

with w G Y° rt - 



\\U Y jF(x)-F(x* + U x Jx-x* 

< m m TR((x - x*) - (U x Jx - x*) - x*)))\\ 
= \\U^ m TR(I-U x J(x-x*)\\ 

< sup \\I^ m TRT*w\\ n \\(I-U x J(x-x* 

weY£ rt ,\\w\\=i 

The first term in the latter is in turn bounded by 



||n£ m Ti2|| sup \\T*(I -U^Jw\\ 

i«er,||«;||=i 

< C(l + c) sup \\(I-U Y jTx\\ r 

x£ker(T)° rt ,\\x\\=l 



since T* is the adjoint operator of T. □ 



With this Lemma we have, for x m = x* + z m , z m e X„ 



\\U Y jF(x m ) - F(x* + U Xm (x - x*)))\\ 2 
> \\U Ym TRU x Jz m - (x -^))|| 2 - [C(l + c T ) 7 m ] 2 ||(/-n x J(:r -x*)f. 
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„*\M|2 
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> 



> 



> 



> 



> 



U^ m TRU Xm (z m -(x -x* 



z m ~ (x - zr*)|| ( inf 



yeY m 



z m ~ Oo - ( mf 



z m ~ (x - x*)\\ ( inf 



m m TM*y\\ 

\\T*y\\ ' 

<Il^ m TRT*y,y>, 

\\T*v\\ \\y\\n 

M||2( ^ < iri'n.T'u :-„, 2 



y&Y m 



\\T*V\\ \\vl 



z m - (xo - x*)\\ ( inf 



\\T*y\\ 2 -<(I-R)T*y,T*y> 



\\T*V\\ hi 



r 



z m -(x -x*)\\ 2 (l-c T f(mi |7 '* y|1 " 
z m -(x -x*)\\ 2 (l-c T ) 2 ri 



Thus, since (J — U Xm )(x — x m ) = (I — n^)^ — x*) under assumption AS, 



(3.5) \\x m -x \\ < _ Cy )2 7 2 + ( 1 + (i- CT )2 M 1 - u xJ{x -x )\\. 

Inequality ()3.5|) is true for whatever x m G x* + X m . Over Q(u), we know 

4r(l + L) 



d ™~ r(l + L)-d 



[ rf m opt +1] + U- 1. 



We distinguish then two cases according to whether m < m opt or not. 

— In the first case it is clear that x m G x* + X mopt and m in (|3.5jl can be replaced 
by m opt . 

— In the second case we have: 

||(i-n*j(xo-s*)|| ^ ll(/-nx mopt )(xo-x^)||. 

In any case we have, over Q(u) D Vtf it {u) 

1 1 Xjji Xq J] 



< 11(1 - n^)(,. - x*)H 2 (l + (LJ, + (i- „)(!-<»)« 



max(C/ ) w 2 ; )) 



+ 



2(pen(m opf ) + u/n) 
(l- K )(l-c T ) 2 7 | 



+ 



(1-«)(1 -pr)2 



pen(m op4 ) + u/n 1 / 4r(l + L) XP 

[a mopt + 1J + u - 1 



7, 



m opt 



dm opt \r{l + L) - d 



< C(r) ||(/-nv )(* -z*)|| 2 + 



opt ' 



pen(m opf A , ^/ 
~ f ( 



U 



p+l 



n 
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})+> 



K(r, a)u p+ 



n 



< 2e~v / ^, 



which ends the proof. 

3.2. Non ordered selection. 

Ordered selection has the advantage of working directly on the observation space. It has the 
disadvantage that the expansion of the solution xq over the resulting subspace X m might not 
be efficient. This introduces the need for non ordered selection, or equivalently, for threshold 
methods. The combination of both ill-posedeness and non linearity yields this a difficult 
problem. Indeed, the former yields that it is no longer possible to work on the observation 
space as this would require simultaneous control of •y m and d m . Working on the solution space 
requires considering the inverse of a certain matrix. The goodness of fit of the estimator is 
then defined by the trace and spectral radius of this inverse matrix restricted to the sequence 
of subspaces, which in turn depends on the degree of nonlinearity of the problem. 

More precisely, let mo be such that 



This quantity can be chosen so as not to depend on the unknown regularity of the solution 
Xq. Under assumption SC the above inequality is satisfied if the dimension of the set is such 
that 



Thus it is enough to choose m such that d mo < n 1 ' 2p . Analogous results are obtained in the 
case of Hilbert scales ([?]). On the other hand, if mo is estimated as is section 2, we have 
||(J — Ux mQ )(xo — x*)\\ satisfies the optimal rates with high probability. 

For this fixed m set A mo = T+JXy^. Let {Y m } m C Y mo be a collection of not necessarily 
nested subspaces. We will use the notation m e m to resume the imbedding of such subsets. 
Our goal is to find the best subspace along this collection using penalized estimation. 
For fixed x, D m (x) = Hx mo R(x)Hx m is a linear operator, D m (x) : X m — > X mo . Let S m be 
the matrix whose entries are defined by 



Set p m = p(S m S m ) and t m = Tr(S m S m ). Let R m = t m /p m . Remark that under our 
assumptions, namely that the basis is orthonormal for the fixed design, both np m and nt m 
do not depend on n. Introduce L m a certain weight factor and in the notation of lemma 15. 3| 
define Ej = Sj(m ), i = 1, 2 by 



||(/ - U Xm )(x - x*)|| < inf[||(7 - ILxJ(* - **)|| + 




d 2up < n iv p+ 2 p +1 . 



S m (i,j) 



sup \(AL.D m (x)){i,j)\. 



x£B p (x*) 



(3.6) 
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and for any q > 2 and a constant C q depending on q 
(3.7) 

S 2 = C q (np m a 2 /dn(d/2rL m (Rm + l)) q ~ 1/2 + {d/2rL m {R m + i))*-i] e -V'/^(*»+D 

m£mo 

As before we will consider penalized estimation. The penalty term in this case will be set to 

pen(m) = rcx 2 (l + L m )[t m + p m ], 
with r > 2. We now define the estimator by 

(3.8) Xm = x* + arg min arg min ||A mo (?/ - F(x m ))\\ 2 + pen(ra). 

memo x m eXm 

Then, we have the following result 

Theorem 3.4. Assume IG, AF and SC are satisfied. Assume hS. 6]) holds true. Then, if 
< K < I, with probability greater than 1 - Y 11 e'^ /d/i ' 2npmo)u 

\\x r ~ n -x \\ 2 <Ul-n Xmo )(x -x*)\\ 2 
2 

+ — -infjarg min (1 + Cy)||a; m — x || 2 + pen(m)} + a 2 u/n 

(1 — Ct — K) m x m ex*+X m 



and for q > 1 



E[||x A -Xo|| 2 r<[||(/-n Xmo )(x -x 



*\ ||2 



inf{arg min (1 + Qr)||xo — x m \\ + pen(m)}] 9 + 
1 — ct — k m x m ex*+x m n q 

Remark 3.5. The above result depends on two factors: t m and R m . However, again for fixed 
x, we have 

A mo DF(x)w = A mo TR(x)w = U Xmo R(x)w, 

so that bounds can be obtained if we know DF. In order to understand the penalty term, 
assume F is linear, F = T, and let bj,<f>j,ipj be its Singular value decomposition (SVD), 
that is, T(f>j = bjipj and T*ipj = bj(f>j. Let Y m be the linear space generated by ((pj)je m 
for a collection of indices m. Assume the Gramm matrix G associated to (^j)jemo satisfies 
c < p(G) < C . Then, pen(m) is roughly proportional to l/w5^ Jgm 1 jb 2 and R m is proportional 
to l/ns\ipj em l/& 2 xX]jem I n the general case, for any collection Y m , pen(m) will depend 
on 

• The Gramm matrix G, 

• how operator T acts along Y mo , that is, mf vGYm \\T*v\\/\\v\\ 

• and on the non linearity of operator F, that is, \t m — Tr{A t m Jlx m )\- 

In the complete diagonal form (pen(m) ~ 1/n ^/b 2 ), the penalty entails a hard thresholding 
scheme: choose a coefficient if (x r - ri )j > J C /b 2 n for some constant which depends on m . 
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Remark 3.6. In the linear case, F = T and x* = 0, the minimization problem in jl.V. 8\) can be 

simplified offering an important insight. Indeed, note that this problem is actually equivalent 
to minimizing 

x m = argminarg min {-2 < A mo y, A mo Tx m > +|| A mo Tx m \\ 2 } + pen(m) 

111 

= argminarg min {-2 < U Xm A mo y,x m > +\\x m \\ 2 } + pen(m). 
So that, for each m, 

Xm.j =< A mo y, tj >=< y, A^ej >, j — 1, . . . , m. 
Thus, m is selected by minimizing 

7 

for Xj the eigenvalues of A t m \~l Xm , which is a hard thresholding scheme. Note that the ill 
conditioned A mo need not be applied to the observation vector y. 
In the non linear case the problem is equivalent to minimizing 

argminarg min {-2 < A mo (y-F(x*)), U x R(x m , x*)(x m -x*) > +||x m -x*|| 2 }+pen(m). 

m x m £x*+X m u 

SetF{x*) = (F«)(^))r=i- Hence, 

(x m - x*)j =< A mo (y - F(x*)),Il Xmo R(x m , x*)ej > 

=< y - F(x*), AlUx^Rixm, x*)e 3 >,j = l,...,m. 

Then m is chosen as above. However, in this case the problem must be solved numerically 
which is troublesome as A mo is a badly conditioned matrix. 

Proof: from the definition for any m and x m , 

\\A mo (F(x ) -F{x m ))\\ 2 < \\A mo (F(x ) -F(x m ))|| 2 + pen(m) 

+2 < e,A* mo A mo (F(x ) - F(x m )) > n +2 < e, A* mo A mo (F(x ) - F(x m )) > n -pen(m). 
We have 

A mo (F(x 1 ) - F(x 2 )) = U mo R(x 1 ,x 2 ){x 1 - x 2 ). 
Hence, the left hand side is bounded from below by (1 — CT)\\Hx m ( x m ~ x o)\\ 2 an d 
\\A mo (F(x ) - F(x m ))\\ 2 < (l + c T )||n Xmo (a; m -x )|| 2 . 

Thus, 

(i- Cr )||n Xmo (x m -x )|| 2 

< (1 + c T )\\U Xmo (x m - x )|| 2 + pen(m) 

+2 < e,A* mo R(x m ,x m ){x m - x m ) > n -pen(m). 

For any m,m' set LT m \ m / = H Xm \x ml and n mnm ' = H Xm nx m ,- With this notation 

Xm' -^mll 1 1 n^pi^fj' (x m ' ^m)!! \\^-rn\m' {x m' -^m)!! '^[^-m'\m{x "m' "^m)|| 7 
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and 

-\- <C £, 74 rrt0 -R(x T n) ^mjn^m (^m *^m) I 
K||x m 3^m|| "I - 2/ K||£»S' r 7 l || -I - 1/ /t|| fTiSVn || . 

The first term in the latter is bounded by 

4\\x m - (U Xmo x + x*)\\ 2 + \\x m - (U Xm( x + x*)\\ 2 ]. 
The proof then follows directly from lemma 1531 

4. Regularization 

Crucial questions in applying regularization methods are convergence rates and how to 
choose regularization parameters to obtain optimal convergence rates. 

Yet another approach is to consider a big enough subspace Y mo and in order to deal with 
the ill posedness of (T*IIy m ) + use Tikhonov regularization methods. 

As in the last section assume that mo is such that 

II (/ " Kx mo )(x - x*)\\ < inf[|| (/ " nxj(* - x*)|| + J^r—}. 

m V n 7 m 

Next consider for a given k G /C a sequence a k (n) — > as n — > oo and define the following 
penalized estimator: 

(4.1) x ak(n} = x* + arg min [||n y ■ (y - F(x))\\ 2 n + a k (n)\\x - x*\\ 2 ] 

x&x*+X m 

We point out that choosing the smoothing sequence a^n) is the key point since it balances 
the two terms: if a k is big the solution will be smooth but will not, in general, comply to 
the observations. On the other hand, if a k is small, the solution might be too close to the 
noisy observations to yield a good approximation of xq. From the theory of inverse problems, 
we know that it is possible to choose a regularization sequence achieving the optimal rate of 
convergence, but this choice depends on u, which characterizes in a way the regularity of the 
solution. 

As for the projection problem we would like too choose a k (n), among all the a k (n), k £ JC 
based on the data in such a way that optimal rates are maintained. This choice must also 
not depend on a priori regularity assumptions. 

In the linear case, if F = T for each fixed a k {n), the expression (j4.1j) can be written in the 
following way: 

(4.2) x ak{n) = x* + arg min ||(TTIy T + a fe (ri)/ mo ) _1 T*ny (y - T(x))|| 2 . 

x£x*+X m u u 

However, in the nonlinear case both estimators are not the same. Although in practice the 
second one is more complicated (the matrix to inverse might be big) in order to select a k , 
we will choose this version of the estimator, with T defined in assumption AF. With this 
notation set 

R Qk{n) = {T t IL Ymo T + a k {n)I mo )- l T t n Ymo . 
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Now set y(x,cek) — \\R ak {y — F(x))\\ 2 and 

pen(a fc ) = m 2 (l + L^Tr^R^) + p 2 (R ak )), 

with r > 2. 

We choose x a „ such that 

x a% = x* + arg min (7(2, a fc (n)) + pen^fn))) . 

Let x ak = x* + R ak T(x - £*). 
Set 



S(^) = E 2 



IdTriRtR 



+ 1 



j c - A /dL k [TV (Rj h R a h ) + P 2 (flq J] / P 2 {R a ' ) 

p 2 (nR ak ) 



for d as in lemma 15 .HI 

We have the following result, 



Theorem 4.1. For any x £ x* + X m and am/ A; snca iaai d mo > a k ^ , the following 
inequality holds true 

(4.3) E||n Xmo (x a .-x )|| 2 < — L_ inf [C(l+ CT )||n Xmo (x afc -Xo)|| 2 + 2pen(a fc )] + -^-, 

Hence, the estimator is optimal in the sense that the adaptive estimator achieves the best 
rate of convergence among all the regularized estimators. 

Remark 4.2. The condition d mo > ct fc 1 ^ 2p ' ) follows from the requirement that 

Tr (K k R* k ) n( -i/(2p)n 
P 2 (^J 1 k } ' 

Proof. There exists a constant C such that for any k, 

(1 - ^Hn^Cxi - x 2 )|| 2 < \\R ak (F( Xl ) - F(x 2 ))\\l < C(l + ^lliW*! - x 2 )|| 2 . 
Thus, for any x £ x* + X mo , 

(1 - CT )||n Xmo (x a . - x )|| 2 < c(i + c T )||n Xmo (x - x )|| 2 

+ 2pen(a fc ) + 2 sup[||i? Q;s £|| 2 - pen(a fc )]. 
The result follows directly from Lemma 15.31 

In the linear case F = T, we get the following proof: For any x ak and any A; £ IN 



and 



\R a Ay - Tx a )|| 2 +pen{a- k ) < \\R ak (y - Tx a J\\ 2 + pen(a k ) 



\R ak {y - Tx ak )\\ 2 = \\R ak T(x - x a J|| 2 + 2{R ak T(x - x ak ),R ak e) + \\R ak e\ 



Thus, following standard arguments we have 
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\\R ajs T(x - x ajt )\\ 2 
< \\R ak T(x -x ak ) || 2 -2 <R at T(x -x af .),R ai e> 

M 2 i ll d 1 1 2 

k " 



+2 < R ak T(x - x ak ), R ak e > -\\R a .e\\ + \\R ak £\\ + pen(a; fc ) + pen(c^). 
Let < k < 1. Since 2a6 < na 2 + ^b 2 , for any a, b we have for any k and x Qfc G X m 

(1 - K )\\R afc T(x - x ak )\\ 2 
< (1 + K)||i? afc T(x - x ak )\\ 2 + 2pen(a fc ) + 2 sup{-||i? afe £:|| 2 - pen(a fc )}, 

k K 

On the other hand, using that is 1 < R ak T < C, we have that for any x ak G X mo and any 
fcG IN, 



(1 - «)||a; - x Q .|| 2 < C(l - w)||a; -ac a J 2 
2pen(a fc ) + 2C X sup {||i? afe £|| 2 - pen(a fc )}. 



k 

As above, the proof then follows directly from lemma HTSl which characterizes the supremum of 
the empirical process under the linear application as defined by the regularization family. □ 

5. Appendix 

In this section we give some technical lemmas. The next lemma characterizes the supremum 
of an empirical process by the norm of an orthogonal projection. 

Lemma 5.1. 

(5-1) sup | < e,y > n | = ||IIy m e||„ 

y£Y m , \\y\\ n =l 



Proof. Using the definition of an orthogonal projector, we have 

1 

lie — tt= —Uye\\l= min lie — vW 2 ,- 

II TTn c- Ym nn r ii n ii 11 »lln 

ll ii y m £ lln {2/ev m , ||y||„=i} 

As a consequence we can write: 

||g:|| 2 - 2 < £. 7— 1 ,, L T y£>n+ ,,^ 1 ||LIv ell 2 = min lldl 2 - 2 < e, y > n +1 

II 111 ' TTn _ Im " TT« r 2 11 *m lln r II II > II lln ' a « 

2 < £ - IIv e, fr— 1 „ riy e > n +2 < EES- e, 1 ,, Ely e > n = 2 sup | < e, y > n | 

im ' TTn _ 'm 11 Im ' TTn _ >m " r" I 1 a n I 

ll iA y ra e lln HWy^lU {j /e y m ,||j / || n =i} 

ll n y m £ IU = SU P I < e 'V >n |, 
{•(/eYm, ||y||n=i} 

which ends the proof. □ 
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The next result is a deviation inequality based on a functional exponential inequality (The- 
orem 7.4) due to [?] 2003. Set rj(A) = sup N , =1 £? =1 e i (A t u) i for A : IR n -> IR fc . Let 



Then, 

Lemma 5.2. 



n 



^ IMI=i p(A'A) (7 



> IE - \ i - + v 7 ^ + x) < e _x . 



VpV2(A*A) ap 1 / 2 (A t A) 

Proof. Since the application u — > A'-u is continuous, we have 77(A) = sup ug5 Yli=i £ i( J ^ t ' u )i f° r 
5 some countable subset of the unit ball. On the other hand, 



sup [-A*u]j < sup < p(A). 

||«||=1 ||u||=l 

Thus sup|| u || <:L \(A t u) i /p 1 / 2 (A t A)\ < 1. Also, following the proof of Corollary 5.1 in [?] 

{A'u) 2 
l5=i P(^) 

" ,1^1,=! p(A*A) 

< SU P T77^ 

||u||=l P(^) 

:= z { . 

Set Z = Z(£i, . . . ,e n ) = r)(A)/(ap 1 / 2 (A t A)). Let TEj stand for the conditional expectation 
given £j for i ^ j. Hence, in the proof of Theorem 7.4 in [?] we may bound 

\ P .\P (A^) 2 ( A t iA 2 

J " o-p M ^ lP {A*A) M ^ i y p(A t A) J " V|J|/ 3 

Thus, E|Z - EjZ|p < z jP \/2. Finally, note that 
Thus, the proof follows from Theorem 7.4 in [?]. 

□ 

As a corollary, we have the following lemma 

Lemma 5.3. • There exists a positive constant d that depends on r/2 such that the 
following inequality holds 

(5.2) P(V 2 (A) > a 2 [Tr(A t A) + p(A t A)]r/2(l + L) + a 2 u) 

< e^{-^d{l/p{A t A)u + r/2L[Tr{A t A)/p{A t A) + 1])} 
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(5.3) 



Set h = rf/(p(A*A)(T 2 ) and k 2 = dr /2L[Tr(A t A)/ p(A t A) + 1]. Then, there exists a 
constant C q , which depends only on q, such that, 

JE[r] 2 (A) - a 2 [Tr(A t A) + p{A t A)]r /2(l + L)}\ 
< C q k^ q [k q - 1/2 + kf X \e-^ 

holds. 



Proof. As a first step we will bound v. Since JEZ < 1E 1/2 Z 2 , we have 



<E^>(^) 2 + 2 



i=i 



v eY;^(-) 2 < (1 + u)JEJ2^-) 2 + Tr(A t A)/p(A t A). 

\ i=l 



i=l 



Moreover, following, [?] p. 480, for all p > 2, the following version of Rosenthal's inequality 
holds: 



W/ 2 ^ Zi { £ -±) 2 < 2 p/2 Tr(A t A)/p(A t A)lE. 



Ifif 
o-P 



Hence, we have 



and 



v < (1 + i/)Tr(A*A)/p(A*A) + 



v 



v 2 < 2[2 2 (1 + v) 2 Tr(A t A)/p(A t A)lE^ + (~) 2 ] 

a 4 v 



Set < a < 1. Choose 5 and (3 such that if 



2 2 4!5 2 (l + l/a)(l-z/) 2 <ci, 
25 2 (l + l/a)(i) 2 <c 2 



and c = max((l + /3) max(ci, c 2 ), (1 + a)), then r/2 > c. Let w > and without loosing 

generality, assume a = 1. Thus, 

P(^ 2 (A) > (Tr(AM) + p(A*A))r/2(l + L) + u) 

< P(r] 2 (A) > {Tr{A l A){l + a) + (1 + l/a)<?Vp(4U))(l + /3) 

+ [r/2 - c](7V(AU) + <SVp(A*4)) + r/2L(Tr(A'A) + p(A*A)) + «) 

< P{rf{A) > (Tr(AM)(l + a) + (1 + l/a)5Vp(A'A))(l + [3) 
+r/2L(Tr(A t A) + p(A*A)) + u) 



Set 



x 



v f3> 



r Tr(A l A) 
2L^ p(A t A) 



+ 1) + 



p(A*A) 
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The last term is equal to 



, rf(A) / Tr(A t A) 
>(A*A) v p(A t A) v 7 

+ (1 + l/a)^ 2 )(l + f}) + r/2L^^ + 1) + u) 

= P{ £vA) " + a) + (1 + V^Xl + /3) + (1 + 

Finally, we may bound 

p x/ (A*A) p 1 / 2 (A t A) 

where we have used repeatedly that for any constant c > 0, ca 2 + l/c6 2 > 2a6 and set 

(/(A) = ((1 + V/T'a + 2/5f){r/2L[Tr{A t A)/p{A t A) + 1] + u/p(^)). 

Set also d = [(1 + + 2/5) 2 ]- 1 and 6(A) = 7V(AU)/p(AU). Thus we have shown the 

first part of the lemma. 

Moreover, using the above inequality, 

E[^ 2 (A) - <7 2 (Tr(A*A) + p{A t A))r /2{l + L)] 9 + 

/•oo 

< / a ^ qu 1-^ e -^dr/2L[b{A)+l]+du/{p(A^A)) du _ 



Consider the change of variable w = du/(p(A t A)) + dr/2L[b(A) + 1], so that 
E[^ 2 (A) - a 2 (Tr(A t A) + p{A t A))r /2{l + L)]« 

< f a2p (f A A Q (w- dr/2L[b(A) + l^e-^dw. 

\ d ) Jdr/2L[b(A)+l] 

The last expression is in turn bounded by 

/ o- 2 p(;f A) \ Q e-^K" 1 + (dr/2L[6(A) + l]) 9 " 1 ]^ 

\ d J Jdr/2L[b(A)+l] 
< (1 k~ q \h q ~ l l 2 4- k q ~ X \p~^ 

Ly^/v-j [ 2 ^ 2 J ? 

ending the proof. 

□ 



PENALIZED ESTIMATORS FOR NONLINEAR INVERSE PROBLEMS 25 

The second author would thank ECOS Nord and Agenda Petroleo de Venezuela for sup- 
porting her work. 

CNRS - Laboratoire de Mathematiques, UMR 8628, Universite Paris-Sud, Bat 425, 91405 
Orsay cedex France 

Departamento de Matematicas, IVIC, Venezuela 

E-mail address: Jean-Michel.Loubes@math.u-psud.fr 
E-mail address: cludena@ivic.ve 



